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covered in detail here. Currently, the NCBI's BLAST algorithm (stand- 


alone) is unable to handle biological data in the terabytes. To address this 
Keywords: problem, a variety of schedulers have been proposed. Existing sequencing 
approaches are based on the Hadoop MapReduce (MR) framework, which 


Bioinformatics genomic enables a diverse set of applications and employs a serial execution strategy 


Sequence that takes a long time and consumes a lot of computing resources. The 
Hadoop author, improves the BLAST algorithm based on the BLAST-BSPMR 
Microsoft azure algorithm to achieve the BLAST algorithm. To address the issue with 
Scheduler parallelization Hadoop's MapReduce framework, a customised MapReduce framework is 
developed on the Azure cloud platform. The experiment findings indicate 
that the suggested bulk synchronous parallel MapReduce-basic local 
alignment search tool (BSPMR-BLAST) algorithm matches bioinformatics 
genomic sequences more quickly than the existing Hadoop-BLAST method, 
and that the proposed customised scheduler is extremely stable and scalable. 
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1. INTRODUCTION 

According to research [1], [2], cloud computing has established itself as the future norm for data- 
intensive computing. According to [3], [4] cloud computing platforms allow on-demand access to shared, 
scalable, fault-tolerant, and reconfigurable computing resources with little administrative work and a low 
cost. In comparison to freestanding private computer clusters [5], [6] demonstrate how cloud computing 
platforms are a commonly desired and recognised method for running large data applications or high- 
performance computations (HPC). The cloud's resource management, virtual computing platforms, and 
elasticity all contribute to the ease with which data-intensive applications can be migrated to the cloud 
paradigm. Cai et al. [7], Okur and Biiyiikkegeci [8] note that the frameworks required to enable the most 
efficient use of cloud resources at the lowest possible cost for data-intensive applications remain unresolved. 
Google’s MapReduce framework discussed in [9] is a commonly used approach for performing distributed 
computations on the cloud. White [10] and Hadoop [11] present that implementations of Hadoop are based on 
the MapReduce architecture and are capable of supporting large-scale data applications. The MapReduce 
model employs a two-stage execution mechanism. The initial stage entails splitting or dividing the data to be 
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processed into little pieces. Each small bit of data is associated with a mapper. The mapper 
outputs (Key|Value) pairs that are arranged according to the Keyvalues. The Reduce workers are provided 
with the sorted values using the (Key|SortedList(Value)) method. The Hadoop distributed file system is 
used to store the results of the reduce workers hadoop distributed file system (HDFS). In most public cloud 
setups, the Map and Reduce workers are virtual machines (VMs). 

The researchers recognised the disadvantages of Hadoop MapReduce and examined several 
improvement strategies. Zhu et al. [12] discusses the use of Hadoop with CUDA to increase processing 
capacity and execution speed. Zhu et al. [12] proposes a research paradigm that facilitates appropriate use of 
graphics processing units (GPU). Dahiphale et al. [13] presents a cloud-based MapReduce model. Pipelining 
is used to improve execution performance and to enable elastic pricing. For scientific applications [14] 
proposed a framework based on parallel tempering called replica exchange statistical temperature molecular 
dynamics. Tang et al. [15] describe a unique dynamic Hadoop slot allocation technique for resolving the 
Hadoop resource provisioning problem that is not optimal. The result reported in Tang et al. [15] demonstrates 
that their technique significantly improves Hadoop's performance when handling many MapReduce jobs. The 
most often utilised technique is to increase execution effectiveness through scheduling practises, as described 
in [16]-[19]. 

Wang et al. [20] presented a method for computing "Imprecise Applications" using MR frameworks. 
Reduce is run after the Map step in predictable MapReduce applications. In the case of inaccurate applications, 
the Reduce stage might be initiated using the Map stage's partial findings. Applications such as word 
frequency statistics and hot-word identification are considered imprecise. Executing inaccurate apps on well- 
known machine learning frameworks like Hadoop introduces latency. Running erroneous apps might result in 
additional expenditures on public cloud systems, where users are charged for all computations and storage 
services consumed. To address this shortcoming incorporated the check phase into the MapReduce framework 
in order to reduce costs and execution latencies [20]. 

In MR frameworks, iterative applications and certain graph-based applications performed poorly. 
Google introduced the Pregel framework for cloud computations in [21] for similar applications. Pregel is built 
on valiant's bulk synchronous parallel (BSP) computation model [22]. Kajdanowicz et al. [23] demonstrate 
that the Pregel framework outperforms MapReduce when calculating graph-based applications. Apart from the 
serial execution mechanism utilised in MR, several concerns remain unresolved, including node failure 
management capabilities, scheduling strategies, and multiway join processes. 

The authors provide a framework for parallel computation MapReduce (PMR) in public cloud 
environments in this article. The MapReduce architecture of the BLAST-BSPMR enables parallel 
computation to accelerate the execution of the BSP model. The BLAST-BSPMR implements Map and 
Reduce workers by utilising Microsoft Azure VM computing environments. Multicore processors enable 
parallel computation in these virtualized computing environments. The BLAST-BSPMR proposal leverages 
this parallel execution feature to drastically lower the Map and Reduce worker nodes' computation times. In 
comparison to other MapReduce frameworks like Hadoop, the BLAST-Reduce BSPMR's phase is initiated 
when two or more worker nodes have completed their jobs. The BLAST-BSPMR function shown here is 
used to calculate the results of bioinformatics BLASTx applications. 

Several strategies and scheduler are proposed by researchers to improve the performance of the 
MapReduce cloud computing framework proposed by Google in Dean et al. [9]. The work presented in Zhu, 
et al. [12] endures the closest similarity to our work presented here. Using CUDA codes the Map and Reduce 
worker tasks execute parallely on the GPU’s is achieved. The integration of Hadoop and GPU is achieved 
using the Hadoop Streaming, Pipes, JCuda and JNI approaches. The experimental investigation given in Jie 
Zhu et al. [12] demonstrates the effectiveness of the suggested approach on a private heterogeneous cloud 
environment using the word count application. The execution efficiency of the JCuda approach over Hadoop 
Streaming, Pipes and JNI is also proved. The major drawback of this approach is that such computational 
models are not suited for public cloud environments as GPU based VM environments aregenerally not offered. 
Public cloud environments like Amazon EC2 that offer such GPU based virtualized computing environments 
provide it at very high costs. 

Dahiphale et al. [13] described the drawbacks of the conventional MapReduce frameworks as 
followsthe MapReduce framework adopts a sequential processing of the Map and Reduce stages. The 
scalability of the MapReduce is limited. The MapReduce framework provides no support for flexible pricing 
options. The MapReduce model provides no support for computing streaming data. To overcome these 
drawbacks a pipelined model is presented to parallelize the execution of the Map and Reduce phase. The 
MapReduce model proposed in [13] is realized on the Amazon public cloud. The spot instance offering of the 
Amazon cloud allows flexible pricing. The experimental study given here demonstrates the efficiency of the 
pipelining-based MapReduce model when compared to the conventional MapReduce model using the word 
count application. The major drawback of the model proposed in [13] is that the locality optimization is not 
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considered and hosting of additional data dependent applications like Smith Waterman and other 
bioinformatics application cannot be executed in a pipelined fashion. 

The computation of certain data intensive applications like graph applications and iterative 
applications on MapReduce frameworks exhibit high computation time and cost. To support such applications, 
Google presented a proprietary cloud computing framework named Pregel that embraces the BSP computing 
model [21]. In Pregel, the graph computations are achieved using a set of super-steps. A super step is used to 
execute the user defined application or function in a parallel fashion using the data item from the database. 
Each data item from the database behaves as an agent. The Pregel system adopts vertex-centric execution 
strategy. The computation of each data item has a graph like representation in BSP. The vertexes in the Pregel 
deactivate post the computation operation and are reactivated only if additional data items are presented to 
them. Once all the vertices are deactivated the computation is said to be complete. The local storage of the data 
items in the nodes executing the computation poses a problem. In the case the data item is large then a spilling- 
to-disk technique needs to be in [23]. 

The MapReduce framework is been adopted by the Hadoop framework in Nguyen et al. [24] for 
computing on the cloud platform. The MapReduce paradigm employs a two-phase technique. The first phase 
divides the input data into little bits of data to be processed. Each small bit of data is associated with a mapper. 
The mapper outputs a<Key, Value> pair that is sorted according to the Key values. Thereducer takes these 
sorted values as the <Key | SortedList (Value)>. These sorted values are stored in the Hadoop Distributed File 
System. Schatz [25] and CloudBurst make use of the Hadoop MapReduce framework as the computing 
platform. These alignment tools perform effectively for small base pair alignments requiring single-gap or un- 
gapped alignment. However, when huge base pairs are considered, these aligners perform poorly. Because all 
present cloud framework sequence aligners take the Hadoop framework into account, Hadoop suffers when 
iterative applications are hosted on the cloud framework. When considering multiway joins, Hadoop performs 
poorly. Since Hadoop framework executes sequentially, again there is degradation in the performance. 
The BSP based Pregel framework exhibits lower computation time for selective applications. In this paper 
the BLAST — BSPMR framework incorporates the MapReduce architecture and an execution strategy is 
conducted in parallel fashion as observed in the BSP. 


2. METHOD 

A cloud platform, BLAST-BSPMR is proposed that allows to apply bioinformatics application like 
gene sequencing. The BLAST-BSPMR uses MapReduce framework which works on a cloud computing 
environment. Azure VM is the Azure infrastructure as service (IaaS) which is used to deploy persistent VMs. 
Thus, the Map and Reduce worker nodes in BLAST-BSPMR are deployed on Azure VMs. Alignment of 
genomic sequences is performed using the BLASTxmethod. Here in BLASTx, a DNA query is compared to 
a protein database. The BLAST-BSPMR algorithm performs sequence alignment in two stages, namely the 
Map and Reduce phases. SinceHadoop framework is sequential, only when the Map phase is completed the 
Reduce phase gets executed. To prevent sequential execution, BLAST-BSPMR considers parallelizing the 
Map and Reduce phases. The Map and Reduce functions are meant to be executed in parallel and to make 
optimal use of the cores available in the worker VMs. 


2.1. Basic local alignment search tool (BLAST) algorithm 

Basic local alignment search tool (BLAST) is used to compare a query sequence with the database 
sequence to find similar sequences in the database. The BLAST consists of various sequence searching 
programmes (including BLASTn, tBLASTn, BLASTx, BLASTp, tBLASTp, and tBLASTx). Each 
application varies in its functionalities, but the core of the algorithm for all of them is basically the same. The 
BLAST algorithm consists of three steps. In the initial step, called as building the word list, the query 
sequence is split intosmall words of fixed size (w-mers), often w being 11. Therefore, if the length of the 
query sequence is n, then list of n-w+1 words are being constructed. This step is illustrated in Figure 1. The 
second stage, referred to as the scan for hits, is where the BLAST algorithm looks for matches to the words 
(referred to as hits) in the database sequence. The BLAST algorithm employs the 'two-hit' strategy. If two 
non-overlapping word pairs are found to be within a distance D of one another, an extension is initiated. The 
next stage, dubbed extend the hits, involves the BLAST algorithm expanding the results in order to find 
longer related segment pairs. When the "two-hit" requirement is met, a gap-free extension in two directions is 
performed to locate an alignment known as a high-scoring segment pair (HSP). Figure 2 gives an example of 
gap-free extension. This is the seed-and-extend heuristic strategy for identifying high-scoring gene sequence 
alignments between the genomic query sequence and the database's genomic sequences. 
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Genomic Query Sequence: TCGACAGACGAGTT 


‘two-hit’ 
CGTGGGACCACCCAGACCTGCGGTTGAACTGTA 
> Word 1; TCGACAGACGA GCTGGGACCACCCAGATCTGCGGTTGAACTGAC 
> Word2: CGACAGACGAG -3-322222222222222-322222222222222-3-3 
> Word 3: GACAGACGAGT HSP 
~~~ > Word 4: ACAGACGAGTT Score: Mismatch: -3 Match: 2 
Figure 1. Build word list Figure 2. Build word list 


2.2. BLASTon the proposed parallelized MapReduce framework 

The genomic sequence database was initially split into trivial chunks of static size and scheduled 
over the virtual Azure computing nodes in the current work. Numerous computing nodes can process the 
chunks in parallel. The three core BLAST processes (building word lists, scanning, and extension) were then 
reorganised and run concurrently. It takes longer in two stages, according to the BLAST algorithm: searching 
for matching terms in the genome database and extending the seeds. Thus, the proposed BLAST-BSPMR 
framework is employed to parallelize the BLAST algorithm, thereby increasing the calculation's overall 
efficiency. The BLAST-BSPMR technique is implemented in three stages: data storage in Azure blob 
storage, genomic sequence preparation, and genomic sequence parallelization. The first stage involves 
transferring the genomic database and query sequence to Azure blob storage. The second level preprocesses 
each computational worker node's genetic data. The third stage performs further accurate matches, extension, 
and statistics on the preprocessed sequence seeds prior to giving them as input to the BLAST-BSPMR 
algorithm. The sections that follow describe the suggested method's several stages. 


2.2.1. Pre-processing of gene sequence data 

As stated in the basic procedure of the NCBI BLAST algorithm, the first criterion the BLAST 
algorithm is to generate a word list for the BLAST-BSPMR method. The purpose of identifying words with a 
high correlation coefficient during the pre-processing of genomic sequence data is to simplify calculations or 
computations concerning accurate/correct matches and seed expansion. This technique accelerates the third 
phase of sequence alignment. Due to the fact that the research query data consists of nucleic acid sequences, 
the lengths of the genomic query and database sequences are x and y, respectively, and the number of words 
is x-10 and y-10. The following stages must be followed in the pre-processing of genomic sequence data: 
Stages of mapping, sorting, shuffle, and reduction. The author generates a statistic of nucleobases (ATGC) 
with the highest occurrence number using a Map scheme on each word in the Map stage, such as <T,4>, 
where T indicates the most often occurring nucleobase and 4 denotes the number. The output of Map is 
sorted by the frequency of each nucleobase during the sort stage. In the shuffle phase, the sort stage's results 
are jumbled and merged. If the frequency of nucleobase T in the wordqs;(i = 0,11,....,10y + 1,....) in the 
genomic query sequence is num! and the count of nucleobase T in the worddb;(i = 0,11,....,10y + 
1,....) in the genomic database sequence is num2, qs;will combine with alldb;, and rank them according to 
the size of num1+num2?. List of alldbjcombined withqs;as items in the BLAST-BSPMR algorithm's word list 
in the final stage of Reduce, based on the output of the shuffle stage. The following step is to save the results 
to Azure Blob Storage for later analysis. 


2.2.2. Alignment of gene sequences in parallel 

The word list for the scanner is produced from preprocessed genomic sequencing data. The three 
basic steps in the parallel alignment of genomic sequences using a word list and a scanner are exact word 
matches, seed expansion, and statistical significance during seed expansion. To facilitate parallelization of 
genomic sequence alignment, the generated word list and scanner should be distributed among a large 
number of virtual processing nodes. The flow chart in Figure 3 depicts the parallelization of the BLAST- 
BSPMR method with MapReduce. 
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Figure 3. The Parallelized BLAST-BSPMR model 


2.2.3. The steps required in aligning the BLAST-BSPMR method in parallel 

a) Each Azure virtual computer node will receive the list of words and scanner, and the gene sequence 

willbe aligned using a BLAST-BSPMR MapReduce Job. 

During the Map phase, each Map task reads the scanner and a list of words from local memory to build 

accurate matches for items with a high correlation; items that satisfy the matching conditions are saved as 

words in the following extension. 

c) Following scanning, the restored bits from second step 2) will go through three processes in 
eachcomputing node: they will initiate a simple match extension without space, they will use DP to find 
amatch extension with vacancies, and they will finally obtain alignment results that satisfy the 
givenconditions. 

d) The results of the comparison will be arranged according to the score acquired during the shuffle phase. 

e) The Reduce phase transfers the genomic sequence alignment findings to Azure Blob Storage in their 
original format. 


b 


wm 


2.3. Proposed BLAST alignment on the parallelized MapReduce model BLAST-BSPMR 

Let DB denote a bioinformatics sequence of the genome database and Qs the genomic query 
sequence. The BLAST — BSPMR is positioned on a public cloud platform namely azure consist of a master 
node, Map and Reduce worker computing nodes. The master computing node of BLAST — BSPMR initializes 
cw Map and Reduce worker computing nodes using the virtual computing nodes. Every virtual computing 
node is presumed to CCPU cores available for task/job computation. Let TEyy inzrepresent the time spent 
initialising the virtual platform. The sequence DB is a bioinformatics database that is divided intoDB’tiny 
chunks of sequence data with overlapping portions. A database's short bit of data sequence and the Qs are 
then transmitted as input key, value pairs to the Map computing nodes. The key value pairings that take into 
account the reference small bit of data DB’are denoted as (kd, vd), where kd is the key and vd which 
contains the genomic data with overlapping offsets. The key-value pair corresponding to the genome query 
sequence Q8 is indicated by (kqs, vqs), where kqa is the key and vqs is the genome query sequence. Each 
of the cw computing Map workers divides the query sequence Qs into Qa’little bits of data and stores them in 
the available local memory storage. The sequence alignment is performed using the BLASTx algorithm in 
parallel using theCcores taking into account DB’and each and every Qs’. The BLASTx algorithm is 
parallelized to reduce the time required to complete genomic sequence alignments. Let TEms denote the 
average time taken by the cw Map compute nodes to complete a task. The mappers give alignment positions 
between Qs and database small bit of data of DB’ along with the computed score as a result of their post 
computations. Multiple alignment places and computed score, i.e., vms combined with the small bit of data 
id of DB'i.e., kms are saved in the memory in the following manner: list(kms, vms). The BLAST- 
BSPMR  Mapfunction is represented as map((Ad, wd), (kqhqs,vgs)) > list kmp, vomp). 
The cw Map computing worker nodes obtain intermediate genomic data i.e. list (kmp, mp) and perform 
shuffle and sort operations. The reduce step considers the collection of all non-overlapping and non-redundant 
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alignment results, i.e., list(agd). Allow TEps is the average time spent by the cw Reduce computing worker 
nodes to complete the storage procedure (gather the results). The total time required for the BLAST-BSPMR 
to align the genomic sequence using a cloud-based platform Q against the database, DB is calculated as, 


E = TEvyy inz + TEws + TErs x (1) 


the BLAST-BSPMR cloud platform is depicted in Figure 3. 


3. RESULTS AND DISCUSSION 

The machine was configured with a 64-bit version of Windows 10 Enterprise, 16 GB of RAM, and an 
i-5 quad core processor. We conducted an experimental investigation comparing the proposed BLAST-BSPMR 
gene sequencing model to the existing Hadoop-BLAST sequencing model on the following parameters: 
speedups, throughput, and sequence alignment completion time using the dot net framework 4.0 and C# 6.0 
programming language for the proposed work and java programming language for the existing Hadoop. 

It is considered to compare the performance of BLAST-BSPMR with Hadoop-BLAST. With a 
single VM computing node, Hadoop-BLAST and BLAST-BSPMR can be deployed. The Microsoft Azure 
cloud platform hosts the BLAST-BSPMR. The Hadoop-BLAST application is built on top of the Hadoop 
MapReduce framework. The Hadoop-BLAST is deployed using Apache Hadoop & YARN 2.6.0. In the 
deployments, identical configurations of VM computing nodes are taken into account. TheBLAST — 
BSPMR model is hosted on Microsoft's Azure cloud considering A3 VM instances. Each A3 VM instance 
consists of 4 virtual computing cores, 7 GB of RAM and 120 GB of local hard drive space. TheBLAST — 
BSPM Rmodel deployed on the Azure cloud platforms are comprised of a master node and a worker node for 
the purpose of performing Map and Reduce operations/tasks. Using Azure HDInsight. It enables deployment 
and provisioning of Apache Hadoop clusters on the Azure cloud platform. The Apache Hadoop & YARN 
version 2.6.0 is considered for performance evaluation. The master node of the azure cluster runs on the 
Windows Server 2012 R2 operating system. One worker node of A3 VM instances is considered for the 
Hadoop deployment. For evaluation, the non-redundant protein genomic database is used. The initial 
investigations employ a continuous non-redundant protein genomic sequence and four nucleotide query 
sequences of varying sizes. Thefirst experiment conducted with the reference and varied query genomic 
sequences are summarized in Table 1. Figure 4 shows the execution time and Figure 5 shows the speedup 
and throughput and Figure 6 shows the average speedup and throughput. The results show that the proposed 
BLAST-BSPMR aligner, which is running on Azure, outperforms Hadoop-BLAST. For both BLAST- 
BSPMR and Hadoop-BLAST, the execution time increases as the query file length grows. The second 
experiment is conducted using varied non-redundant protein genomic sequence and constant query sequences 
of nucleotide. The total time (including the Map and Reduce stages) taken to execute the alignment is 
monitored using Table 2 for the BLASTx sequence alignment computation on the BLAST-BSPMR and 
Hadoop-BLAST. The results obtained are summarized in Table 3. The total time required to complete 
sequence alignment (as depicted in Figure 7) is recorded. The results show that the proposed BLAST- 
BSPMR aligner, which is running on Azure, outperforms Hadoop-BLAST. Figure 8 shows the speedup and 
throughput for the experiment 1 and Figure 9 depicts the average speedup and throughput obtained for the 
experiment 2. As the database file size increases, the execution time also increases for both BLAST-BSPMR 
and Hadoop-BLAST. 


Table 1. Experiment data used to compare BLAST-performance BSPMR's to that of BLAST-Hadoop with 
varied size of query genome 


No Database Genome Size (GB) Query Genome Size (kb) 
1 nr.01(non-redundant protein) 2.99 Nucleotide sequence 16 
2 nr.01(non-redundant protein) 2.99 Nucleotide sequence 32 
3 nr.01(non-redundant protein) 2.99 Nucleotide sequence 64 
4 _ nr.01(non-redundant protein) 2.99 Nucleotide sequence 128 


Table 2. Experiment data used to compare BLAST-performance BSPMR's to that of BLAST-Hadoop with 
varied size of database genom 


No Database genome Size (GB) Query genome Size (kb) 
1 nr.40 (non-redundant protein) 0.078 Nucleotide sequence 128 
2 nr.04 (non-redundant protein) 0.299 Nucleotide sequence 128 
3 nr.05 (non-redundant protein) 1.075 Nucleotide sequence 128 
4 nr.01 (non-redundant protein) 2.99 Nucleotide sequence 128 
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Figure 6. Average speedup and throughput 


Table 3. Speedup and throughput for varied query genome 


Sequence Alignment Execution Time for 
varied Database 


nr.40 nr.04 nr.0S 


Protein Database 


Seg. File Size Speedup Throughput 
16 KB 4.1 76% 
32 KB 2.6 62% 
64 KB 2.14 53% 
128 KB 2.07 53% 

Average 2.72 51% 

0.078 GB 7.067 85.0% 
0.299 GB 335 70.0% 
1.075 GB 2.16 53.8% 
2.99 GB 2.07 51.8% 
Average 3.66 65.15% 
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Figure 5. Speedup and throughput 
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4. CONCLUSION 

Aligning genomic sequences is a straightforward technique for managing and analysing 
bioinformatics data. The author describes the proper methods for aligning sequences using the BLAST 
algorithm, which is the most widely used tool for aligning local sequences. Currently, the BLAST algorithm 
given by NCBI (stand-alone) is unable to handle dynamic biological data in the terabyte range. Cloud 
platforms are used to address issues such as data storage and data-intensive computation. Biosequencing 
analysis of genetic data is a critical application. In this study, the author evaluates many existing schedulers. 
The existing scheduler has issues which are expensive with aligning genomic data sequences. Existing 
bioinformatics sequence aligners that make use of or embrace Hadoop MapReduce suffer from the concerns 
discussed in this study. In this paper, the proposed BLAST-BSPMR algorithm is employed to align 
sequences. For biosequence alignment, the BLASTx algorithm is used, and a parallel MapReduce execution 
approach based on Azure cloud is used in the BLAST-BSPMR cloud platform. The Map and Reduce 
framework are executed in parallel to take use of the cloud computing platform's virtual machine-based 
design. Additionally, the research compares the proposed BLAST-BSPMR technique to previously published 
system sequence alignments. Experiments are carried out with a variety of non-redundant proteins and 
nucleotide query sequence files of various sizes. Experiments are presented to demonstrate the efficiency of 
the BLASTx algorithm. Through an experimental study, a comparison with Hadoop-BLAST for sequence 
alignment is presented. When comparing the BLAST-BSPMR results to the Hadoop-BLAST results, the 
BLAST-BSPMR results show a significant improvement. The authors plan to test their theory on a variety of 
redundant databases in the future such as SwissProt protein, and REFSEQ and also run varied application 
that are available in NCBI such as BLASTn, and BLASTp on our proposed BSPMR Scheduler to further 
analyse the robustness and efficacy of our scheduler. 
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